De Novo Genome Assembly    ◾    97

less ecoli-contigs.fa

less ecoli-scaffolds.fa

Use “awk” command to print the length of the longest scaffold in the scaffold file.

awk ‘{print length}’ ecoli-contigs.fa | sort -n | tail -n1

3.2.2  SPAdes

SPAdes [9] is a de novo genome assembler developed primarily for assembling small

genomes of bacteria. Later, modules were added for assembling small genomes of other

organisms including fungi and viruses. It is not recommended for assembling large mam-

malian genomes. The current SPAdes version works with both Illumina and Ion Torrent

reads, and it can be used for genome hybrid assembly for PacBio, Oxford Nanopore, and

Sanger reads. This assembler can process several paired-end and mate-paired files in the

same time. The program also provides separate modules for metagenomic data, plasmid

assembly from the whole genome sequencing data, plasmid from metagenomic data, tran-

scriptome assembly from RNA-Seq data, biosynthetic gene cluster assembly with paired-

end reads, viral genome assembly from RNA viral data, SARS-CoV-2 assembly, and TruSeq

barcode assembly. The assembling process of SPAdes includes four stages. First, de Bruijn

graphs are built from overlapping k-mers generated from the reads. Second, the k-mers are

adjusted to obtain accurate distance estimates between k-mers using both distance histo-

grams and paths in the assembly graphs. The program then constructs paired de Bruijn

graphs, which is a generalization of the de Bruijn graph that incorporates mate-pair infor-

mation into the graph structure [10]. Finally, contigs are constructed from the graphs.

SPAdes program is made up of modules in Python. The installation instructions are

available at “https://cab.spbu.ru/files/release3.15.4/manual.html”. To install the program

on Linux, use the following steps:

Using the Linux terminal, first download and decompress the source program in a local

directory.

wget https://cab.spbu.ru/files/release3.15.4/SPAdes-3.15.4-Linux.

tar.gz

tar -xzf SPAdes-3.15.4-Linux.tar.gz

Notice that the program name or path may change in the future.

FIGURE 3.7  Genome assembly metrics.